Welcome to my spot on the web for drafts, supplemental material, and general thoughts about doing a thesis project
for the Master of Science in Predictive Analytics degree (now the Master's in Data Science (MSDS) program)
from Northwestern University. Below the interactive plots, I'm developing a sort of "epilogue" containing
thoughts about doing a data science Master's, choosing the thesis option, and some of the things I've learned
along the way.
Thesis Paper
I'll update this section with drafts as they get finished.
2018-11-04: I have a (mostly) completed draft you can check out here on Google Drive.
I'm currently awaiting comments from readers so no doubt it will change substantially. I haven't put in
a Table of Contents and I'm still figuring out how to list the supplemental materials you'll find on this page
in but everything else is there (hooray!).
2018-12-16: A lot has changed in the last month or so! I've decided to push back my tentative graduation
date from this month to the end of the 2019 Winter Quarter, in part due to starting a new position as a data
scientist at Highmark Health here in Pittsburgh. I had
the thesis draft reviewed by my first reader who suggested some restructuring for the Conclusions section but
otherwise found it to be good.
I spent a few weeks away from the thesis which allowed me to come back to it with a fresh set of eyes. I made
some grammar edits and added the Table of Contents as well as the Appendix listing the supplemental material
(links to the Github repo and this webpage).
The most recent version is v.4.0 which can be accessed here.
This is a completely formatted draft with all the necessary components as outlined in the Graduate Thesis Handbook.
I'm happy to have some time to finish the process in a way that isn't rushed. I'll be working over the holidays
to restructure the Conclusions section and hope to get notes from a second reader by the end of January. Barring
any substantial unforeseen issues, I should have everything done by the March 15th deadline to graduate at the end
of the Winter 2019 quarter (hooray!).
Below are four interactive multidimensional scaling plots of genetic profiles developed from open-source RNA-seq
data available from the Aging, Dementia, and TBI Study
from the Allen Brain Science Institute.
Use your mouse to grab them, rotate them, and zoom in and out. Hovering over a data point gives the point's coordinates in the first three MDS dimensions. Each point
represents a genetic profile (based on expression levels for 50,000+ genes and gene isoforms) for an individual patient/donor.
These were made using Plotly and htmlwidgets
for R. Check out this blog post
for more on multidimensional scaling of gene expression level data.
Shaded by Brain Region
HIP = hippocampus
FWM = forebrain white matter
PCx = parietal cortex
TCx = temporal cortex
plotly
Shaded by Donor Sex
plotly
Shaded by Lifetime Number of Traumatic Brain Injuries (TBIs)
A comparison of the numbers of "significant" genes obtained with different filtering parameters and
p-value cutoffs for determining differential expression in donors with dementia.
As a part of the exploratory analysis of the RNA-seq transcriptome data, I investigated the 29 genes that had
altered expression patterns in all four brain regions sampled from donors with dementia
(hippocampus, forebrain white matter, parietal cortex, or temporal cortex).
Things I've Learned by Doing a Data Science Master's Thesis
As things start to wrap up for me, I'm finding myself reflecting on the entire experience of doing the MSPA program.
Maybe you stumbled onto this page beacuse you're thinking of pursuing a data science Master's
degree. Or maybe you're already in the MSDS program at Northwestern or somewhere else and are trying to
make the "thesis or capstone" decision. In this section, I'll be keeping a list of some of the things
I've learned from doing this degree with a focus on doing a thesis project. Just my $0.02. FWIW, etc. I'm
putting it down here as a sort of epilogue to the thesis once she's all done.
“Life can only be understood backwards; but it must be lived forwards.” - Kierkegaard
Doing this program was a great decision for me. As someone moving from academia to industry
AND changing careers, the ability to talk to and learn from people already doing data science in
a variety of industries was exactly what I needed. Classes were challenging and I appreciated the flexability
of an entirely online program. My classmates are incredible people. I learned so much from interacting with
them and our instructors as well. The structure of an actual academic program was good for me because it
kept me on track and provided me with a level of accountability, ensuring that I was learning what I needed to
learn. Your mileage may vary but, for me, it was well worth the investment I made (more specifics on that soon 😉).
You get out of it what you put into it. Most educational experiences are like this, I bet. I'm
not saying anything here you probably don't already know. I figure if you (or your company) is going to
drop a lot of coin on a program like this, why not go the extra mile, if you can? Show up. Be creative. And don't be
afraid to come in last in your class in a Kaggle InClass competition or bomb a technical interview in truly
spectacular fashion 😉 It just might be the best thing that ever happens to you.
Got time and an idea? Not doing data science for a living yet? Do the thesis. I have learned
more in the past year of self-study for the thesis project than I ever thought I would. I grok
more about statistics, clustering, penalized linear models, binary classifiers, and so much more now for
having done this thing. I feel like I can talk about those things and be confident in what I'm saying.
Personally, I learn best by doing, screwing it up, doing it over, screwing it up some more, etc. If you're
not doing data science for a living yet, and don't have the opportunity to work with real data on the reg, the
thesis project can be a terrific way to get an understanding beyond the Titanic and MNIST.
BUT, doing the thesis will take a long time. Maybe not for some people, but on average it does
take longer than a quarter. Maybe two. I gave myself a year to do it with everything else going on in my life
and it could take longer. But for me it is worth it. You'll have to weigh the options for yourself.
The Northwestern MSDS Canvas site has resources to help you decide if doing a thesis is for you. Also,
Dr. Alianna Maren has a very honest flowchart
for making the decision that you can check out. Be prepared to work independently but don't pass up
opportunities to use University resources like The Writing Place.
You/your company are, after all, paying for them 😊
If you have the time, blog about your journey. I'm writing this in HTML right now, something I
never thought I'd learn over the course of doing the program/pivoting to a career in data science.
I'm so glad I discovered GitHub Pages and set up this website because I've learned so much bonus stuff
in the process. A little web design. A little CSS and HTML. Even a little Ruby. It's a place to showcase
your work and maybe (hopefully!) interact with others. And speaking of GitHub...
GitHub is amazing. Get an account if only to share code with your classmates but, it is so much
more than that. Software developers moving to data science already know about this thing but I had no idea.
I had an epiphany back in about April/May 2018 when I learned a little about how GitHub is actually
used to organize projects and stuff. I mean...
I changed all the scripts I had written for the thesis project up to that point so that, when I finished
the project, anybody could clone the project repository, run the scripts in order, and reproduce any of the
results I put in my thesis or on this blog. I mean, wow. It blew my mind when I discovered that. Open source is
amazeballs. All our favorite R and Python packages are developed there and we can be a part of them.
How cool! I have the zeal of the converted but seriously. Make use of GitHub. Especially if you're doing a
thesis project that you can share with others/point potential employers towards if you're on the hunt for
a new gig.
For writing up, use EndNote for referencing and format the entire document as you go. If you're a student,
chances are you can get a free copy of EndNote, a software package that will help keep your
references organized and that will automatically construct a bibliography for you. If you're a Northwestern student,
you can get a copy of EndNote free from IT here.
Do yourself a huge solid and learn about the 'cite-while-you-write' feature. The Graduate Thesis Handbook
(links to an older copy) suggests using APA or Chicago style but you can use any citation format so long as it's consistent.
Read over the formatting requirements and build them right into your document from the start. It saves a lot of
time. One thing I learned was that Word will write a Table of Contents
for you automatically if you specify headers. I had no idea! It makes a much nicer ToC than doing it by hand.